Skip to content

Handle cloning of objects larger than 4GB on Windows#2102

Open
dscho wants to merge 6 commits intogitgitgadget:masterfrom
dscho:fix-large-clones-on-windows
Open

Handle cloning of objects larger than 4GB on Windows#2102
dscho wants to merge 6 commits intogitgitgadget:masterfrom
dscho:fix-large-clones-on-windows

Conversation

@dscho
Copy link
Copy Markdown
Member

@dscho dscho commented Apr 28, 2026

On Windows, unsigned long is 32-bit even on 64-bit systems. This
causes multiple problems when Git handles objects larger than 4GB.
This patch series is a very targeted fix for a very early part of the
problem: it addresses the most fundamental truncation points that
prevent a >4GB object from surviving a clone at all.

Specifically, this fixes:

  • zlib's uLong wrapping and triggering BUG() assertions in the
    git_zstream wrapper
  • Object sizes being truncated in pack streaming, delta headers, and
    index-pack/unpack-objects
  • pack-objects re-encoding reused pack entries with a truncated
    size, producing corrupt packs on the wire

Many other code paths still use unsigned long for object sizes
(e.g., cat-file -s, object_info.sizep, the delta machinery) and
will need their own conversions. This series does not attempt to fix
those.

Based on work by @LordKiRon in git-for-windows#6076.

The last two commits add a test helper that synthesizes a pack with
a >4GB blob and regression tests that clone it via both the
unpack-objects and index-pack code paths using file://
transport.

cc: Derrick Stolee stolee@gmail.com
cc: Torsten Bögershausen tboegi@web.de
cc: Jeff King peff@peff.net

dscho added 6 commits April 28, 2026 01:13
When unpacking objects from a packfile, the object size is decoded
from a variable-length encoding. On platforms where unsigned long is
32-bit (such as Windows, even in 64-bit builds), the shift operation
overflows when decoding sizes larger than 4GB. The result is a
truncated size value, causing the unpacked object to be corrupted or
rejected.

Fix this by changing the size variable to size_t, which is 64-bit on
64-bit platforms, and ensuring the shift arithmetic occurs in 64-bit
space.

This was originally authored by LordKiRon <https://github.com/LordKiRon>,
who preferred not to reveal their real name and therefore agreed that I
take over authorship.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
On Windows, zlib's `uLong` type is 32-bit even on 64-bit systems. When
processing data streams larger than 4GB, the `total_in` and `total_out`
fields in zlib's `z_stream` structure wrap around, which caused the
sanity checks in `zlib_post_call()` to trigger `BUG()` assertions.

The git_zstream wrapper now tracks its own 64-bit totals rather than
copying them from zlib. The sanity checks compare only the low bits,
using `maximum_unsigned_value_of_type(uLong)` to mask appropriately for
the platform's `uLong` size.

This is based on work by LordKiRon in git-for-windows#6076.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The odb_read_stream structure uses unsigned long for the size field,
which is 32-bit on Windows even in 64-bit builds. When streaming
objects larger than 4GB, the size would be truncated to zero or an
incorrect value, resulting in empty files being written to disk.

Change the size field in odb_read_stream to size_t and introduce
unpack_object_header_sz() to return sizes via size_t pointer. Since
object_info.sizep remains unsigned long for API compatibility, use
temporary variables where the types differ, with comments noting the
truncation limitation for code paths that still use unsigned long.

This was originally authored by LordKiRon <https://github.com/LordKiRon>,
who preferred not to reveal their real name and therefore agreed that I
take over authorship.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The delta header decoding functions return unsigned long, which
truncates on Windows for objects larger than 4GB. Introduce size_t
variants get_delta_hdr_size_sz() and get_size_from_delta_sz() that
preserve the full 64-bit size, and use them in packed_object_info()
where the size is needed for streaming decisions.

This was originally authored by LordKiRon <https://github.com/LordKiRon>,
who preferred not to reveal their real name and therefore agreed that I
take over authorship.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
To test Git's behavior with very large pack files, we need a way to
generate such files quickly.

A naive approach using only readily-available Git commands would take
over 10 hours for a 4GB pack file, which is prohibitive.

Side-stepping Git's machinery and actual zlib compression by writing
uncompressed content with the appropriate zlib header makes things
much faster. The fastest method using this approach generates many
small, unreachable blob objects and takes about 1.5 minutes for 4GB.
However, this cannot be used because we need to test git clone, which
requires a reachable commit history.

Generating many reachable commits with small, uncompressed blobs takes
about 4 minutes for 4GB. But this approach 1) does not reproduce the
issues we want to fix (which require individual objects larger than
4GB) and 2) is comparatively slow because of the many SHA-1
calculations.

The approach taken here generates a single large blob (filled with NUL
bytes), along with the trees and commits needed to make it reachable.
This takes about 2.5 minutes for 4.5GB, which is the fastest option
that produces a valid, clonable repository with an object large enough
to trigger the bugs we want to test.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
The shift overflow bug in index-pack and unpack-objects caused incorrect
object size calculation when the encoded size required more than 32 bits
of shift. This would result in corrupted or failed unpacking of objects
larger than 4GB.

Add a test that creates a pack file containing a 4GB+ blob using the
new 'test-tool synthesize pack --reachable-large' command, then clones
the repository to verify the fix works correctly.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
@dscho dscho marked this pull request as ready for review April 28, 2026 13:02
@dscho
Copy link
Copy Markdown
Member Author

dscho commented Apr 28, 2026

/submit

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented Apr 28, 2026

Submitted as pull.2102.git.1777393580.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2102/dscho/fix-large-clones-on-windows-v1

To fetch this version to local tag pr-2102/dscho/fix-large-clones-on-windows-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2102/dscho/fix-large-clones-on-windows-v1

Comment thread delta.h
@@ -86,8 +86,11 @@ void *patch_delta(const void *src_buf, unsigned long src_size,
* This must be called twice on the delta data buffer, first to get the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 4/28/26 12:26 PM, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > The delta header decoding functions return unsigned long, which
> truncates on Windows for objects larger than 4GB. Introduce size_t
> variants get_delta_hdr_size_sz() and get_size_from_delta_sz() that
> preserve the full 64-bit size, and use them in packed_object_info()
> where the size is needed for streaming decisions.

> + * Size_t variant that doesn't truncate - use for >4GB objects on Windows.
> + */
> +static inline size_t get_delta_hdr_size_sz(const unsigned char **datap,
> +					   const unsigned char *top)
...
> +static inline unsigned long get_delta_hdr_size(const unsigned char **datap,
> +					       const unsigned char *top)
> +{
> +	size_t size = get_delta_hdr_size_sz(datap, top);
>   	return cast_size_t_to_ulong(size);
>   }

I like this trick to use the 64-bit implementation and only to
down-cast for API compatibility. This allows a more gradual
transition than if we replaced ulongs with size_ts everywhere
at once.

Thanks,
-Stolee

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented Apr 29, 2026

User Derrick Stolee <stolee@gmail.com> has been added to the cc: list.

Comment thread t/t5608-clone-2gb.sh
@@ -49,4 +49,41 @@ test_expect_success 'clone - with worktree, file:// protocol' '

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 4/28/26 12:26 PM, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
> > The shift overflow bug in index-pack and unpack-objects caused incorrect
> object size calculation when the encoded size required more than 32 bits
> of shift. This would result in corrupted or failed unpacking of objects
> larger than 4GB.
> > Add a test that creates a pack file containing a 4GB+ blob using the
> new 'test-tool synthesize pack --reachable-large' command, then clones
> the repository to verify the fix works correctly.

As mentioned in the previous patch, constructing this large packfile
takes ~4 minutes in CI pipelines. That's quite a lot to handle for
every CI run.

> +test_expect_success SIZE_T_IS_64BIT 'set up repo with >4GB object' '

Your prereq here prevents it from running on 32-bit builds, which is
good. However, I wonder if it would be worth also specifying these
tests as expensive. It's less likely that these layers will be touched
often, so it should be enough to run these on major occasions, such as
testing a release candidate.

(I also think it's appropriate to have these tests _not_ be marked
expensive in their original contribution to git-for-windows/git,
because the pull request build should prove that the tests work. And
maybe git-for-windows/git should keep them for every PR since that's
where the tests matter the most.)

I suppose this also is a question for Junio and our process for
validating releases. Do we have a certain cadence where we run the
expensive tests? What has been our threshold for hiding a test case
behind the expensive label?

Thanks,
-Stolee

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeff King wrote on the Git mailing list (how to reply to this email):

On Wed, Apr 29, 2026 at 09:34:21AM -0400, Derrick Stolee wrote:

> As mentioned in the previous patch, constructing this large packfile
> takes ~4 minutes in CI pipelines. That's quite a lot to handle for
> every CI run.

And for local runs, too. ;) The test suite takes less than 90 seconds to
run on my laptop, but t5608 by itself 160 seconds. Even if running it in
parallel didn't slow down the rest of the suite (which is not true,
because it's obviously hogging a whole processor the whole time), that's
still almost doubling the run-time.

I'd also worry about assuming that the trash directory can hold 4+GB
(maybe 8GB+ since we clone it?), especially since many of us use ram
disks.

That said...

> > +test_expect_success SIZE_T_IS_64BIT 'set up repo with >4GB object' '
> 
> Your prereq here prevents it from running on 32-bit builds, which is
> good. However, I wonder if it would be worth also specifying these
> tests as expensive. It's less likely that these layers will be touched
> often, so it should be enough to run these on major occasions, such as
> testing a release candidate.

I think it is already skipped in most cases, because t5608 requires the
GIT_TEST_CLONE_2GB environment variable be set. Arguably it should just
be using EXPENSIVE, too, as I do not think there is much value in having
individual flags for all of the expensive tests. I think that test just
predates the modern prereq system entirely.

> I suppose this also is a question for Junio and our process for
> validating releases. Do we have a certain cadence where we run the
> expensive tests? What has been our threshold for hiding a test case
> behind the expensive label?

AFAIK the labeling of expensive things is mostly ad-hoc, and nobody is
systematically running them. Likewise for the t/perf tests, which are
super expensive but do (very occasionally) turn up interesting
regressions.

-Peff

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 5/1/2026 2:38 AM, Jeff King wrote:
> On Wed, Apr 29, 2026 at 09:34:21AM -0400, Derrick Stolee wrote:

>>> +test_expect_success SIZE_T_IS_64BIT 'set up repo with >4GB object' '
>>
>> Your prereq here prevents it from running on 32-bit builds, which is
>> good. However, I wonder if it would be worth also specifying these
>> tests as expensive. It's less likely that these layers will be touched
>> often, so it should be enough to run these on major occasions, such as
>> testing a release candidate.
> 
> I think it is already skipped in most cases, because t5608 requires the
> GIT_TEST_CLONE_2GB environment variable be set. Arguably it should just
> be using EXPENSIVE, too, as I do not think there is much value in having
> individual flags for all of the expensive tests. I think that test just
> predates the modern prereq system entirely.

Thanks for the extra details here! That helps avoid the issues that I
was thinking about, but maybe doubling-down and adding EXPENSIVE is
still worth it. 
>> I suppose this also is a question for Junio and our process for
>> validating releases. Do we have a certain cadence where we run the
>> expensive tests? What has been our threshold for hiding a test case
>> behind the expensive label?
> 
> AFAIK the labeling of expensive things is mostly ad-hoc, and nobody is
> systematically running them. Likewise for the t/perf tests, which are
> super expensive but do (very occasionally) turn up interesting
> regressions.
I used to be more diligent about running the performance tests myself
around release windows. The EXPENSIVE tests would also be good to do
on rc0. I will contemplate how to put this into my routine.

Thanks,
-Stolee

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented Apr 29, 2026

Derrick Stolee wrote on the Git mailing list (how to reply to this email):

On 4/28/26 12:26 PM, Johannes Schindelin via GitGitGadget wrote:
> On Windows, unsigned long is 32-bit even on 64-bit systems. This causes
> multiple problems when Git handles objects larger than 4GB. This patch
> series is a very targeted fix for a very early part of the problem: it
> addresses the most fundamental truncation points that prevent a >4GB object
> from surviving a clone at all.
> > Specifically, this fixes:
> >   * zlib's uLong wrapping and triggering BUG() assertions in the git_zstream
>     wrapper
>   * Object sizes being truncated in pack streaming, delta headers, and
>     index-pack/unpack-objects
>   * pack-objects re-encoding reused pack entries with a truncated size,
>     producing corrupt packs on the wire

I'm glad to see this progress in this direction. It's a big step!

> Many other code paths still use unsigned long for object sizes (e.g.,
> cat-file -s, object_info.sizep, the delta machinery) and will need their own
> conversions. This series does not attempt to fix those.

I appreciate the mechanisms used to keep the scope of change minimal. This
differs from the typical "replace all 'unsigned long's with 'size_t'"
proposals in some clever ways.

> Based on work by @LordKiRon in git-for-windows/git#6076.
> > The last two commits add a test helper that synthesizes a pack with a >4GB
> blob and regression tests that clone it via both the unpack-objects and
> index-pack code paths using file:// transport.

My biggest concern here is about how expensive this test is. I wonder if
we should mark it as expensive for the core project and leave it enabled by
default in git-for-windows/git.

Thanks,
-Stolee

Comment thread builtin/index-pack.c
@@ -37,7 +37,7 @@ static const char index_pack_usage[] =

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Torsten Bögershausen wrote on the Git mailing list (how to reply to this email):

On Tue, Apr 28, 2026 at 04:26:15PM +0000, Johannes Schindelin via GitGitGadget wrote:
> From: Johannes Schindelin <johannes.schindelin@gmx.de>
> 
> When unpacking objects from a packfile, the object size is decoded
> from a variable-length encoding. On platforms where unsigned long is
> 32-bit (such as Windows, even in 64-bit builds), the shift operation
> overflows when decoding sizes larger than 4GB. The result is a
> truncated size value, causing the unpacked object to be corrupted or
> rejected.
> 
> Fix this by changing the size variable to size_t, which is 64-bit on
> 64-bit platforms, and ensuring the shift arithmetic occurs in 64-bit
> space.
> 
> This was originally authored by LordKiRon <https://github.com/LordKiRon>,
> who preferred not to reveal their real name and therefore agreed that I
> take over authorship.

Good to see things moving forward.

See even
https://github.com/git-for-windows/git/pull/2179
which is probably obsolete soon.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented Apr 30, 2026

User Torsten Bögershausen <tboegi@web.de> has been added to the cc: list.

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 1, 2026

User Jeff King <peff@peff.net> has been added to the cc: list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant